450 research outputs found
Student-teacher training with diverse decision tree ensembles
Student-teacher training allows a large teacher model or ensemble of teachers to be compressed into a single student model, for the purpose of efficient decoding. However, current approaches in automatic speech recognition assume that the state clusters, often defined by Phonetic Decision Trees (PDT), are the same across all models. This limits the diversity that can be captured within the ensemble, and also the flexibility when selecting the complexity of the student model output. This paper examines an extension to student-teacher training that allows for the possibility of having different PDTs between teachers, and also for the student to have a different PDT from the teacher. The proposal is to train the student to emulate the logical context dependent state posteriors of the teacher, instead of the frame posteriors. This leads to a method of mapping frame posteriors from one PDT to another. This approach is evaluated on three speech recognition tasks: the Tok Pisin and Javanese low resource conversational telephone speech tasks from the IARPA Babel programme, and the HUB4 English broadcast news task
Environmentally robust ASR front-end for deep neural network acoustic models
This paper examines the individual and combined impacts of various front-end approaches on the performance of deep neural network (DNN) based speech recognition systems in distant talking situations, where acoustic environmental distortion degrades the recognition performance. Training of a DNN-based acoustic model consists of generation of state alignments followed by learning the network parameters. This paper first shows that the network parameters are more sensitive to the speech quality than the alignments and thus this stage requires improvement. Then, various front-end robustness approaches to addressing this problem are categorised based on functionality. The degree to which each class of approaches impacts the performance of DNN-based acoustic models is examined experimentally. Based on the results, a front-end processing pipeline is proposed for efficiently combining different classes of approaches. Using this front-end, the combined effects of different classes of approaches are further evaluated in a single distant microphone-based meeting transcription task with both speaker independent (SI) and speaker adaptive training (SAT) set-ups. By combining multiple speech enhancement results, multiple types of features, and feature transformation, the front-end shows relative performance gains of 7.24% and 9.83% in the SI and SAT scenarios, respectively, over competitive DNN-based systems using log mel-filter bank features.This is the final version of the article. It first appeared from Elsevier via http://dx.doi.org/10.1016/j.csl.2014.11.00
Recommended from our members
Multi-basis adaptive neural network for rapid adaptation in speech recognition
Automatic speech recognition system development in the “wild“
The standard framework for developing an automatic speech recognition (ASR) system is to generate training and development data for building the system, and evaluation data for the final performance analysis. All the data is assumed to come from the domain of interest. Though this framework is matched to some tasks, it is more challenging for systems that are required to operate over broad domains, or where the ability to collect the required data is limited. This paper discusses ASR work performed under the IARPA MATERIAL program, which is aimed at cross-language information retrieval, and examines this challenging scenario. In terms of available data, only limited narrow-band conversational telephone speech data was provided. However, the system is required to operate over a range of domains, including broadcast data. As no data is available for the broadcast domain, this paper proposes an approach for system development based on scraping "related" data from the web, and using ASR system confidence scores as the primary metric for developing the acoustic and language model components. As an initial evaluation of the approach, the Swahili development language is used, with the final system performance assessed on the IARPA MATERIAL Analysis Pack 1 data.The Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via Air Force Research Laboratory (AFRL
Annotating large lattices with the exact word error
The acoustic model in modern speech recognisers is trained discriminatively, for example with the minimum Bayes risk. This criterion is hard to compute exactly, so that it is normally approximated by a criterion that uses fixed alignments of lattice arcs. This approximation becomes particularly problematic with new types of acoustic models that require flexible alignments. It would be best to annotate lattices with the risk measure of interest, the exact word error. However, the algorithm for this uses finite-state automaton determinisation, which has exponential complexity and runs out of memory for large lattices. This paper introduces a novel method for determinising and minimising finite-state automata incrementally. Since it uses less memory, it can be applied to larger lattices.This work was supported by EPSRC Project EP/I006583/1 (Generative Kernels and Score Spaces for Classification of Speech) within the Global Uncertainties Programme and by a Google Research Award.This is the author accepted manuscript. The final version is available from ISCA via http://www.isca-speech.org/archive/interspeech_2015/i15_2625.htm
Recommended from our members
Paraphrastic language models and combination with neural network language models
In natural languages multiple word sequences can represent the same
underlying meaning. Only modelling the observed surface word sequence can result in poor context coverage, for example, when using n-gram language models (LM). To handle this issue, paraphrastic LMs were proposed in previous research and successfully applied to a US English conversational telephone speech transcription
task. In order to exploit the complementary characteristics of paraphrastic LMs and neural network LMs (NNLM), the combination
between the two is investigated in this paper. To investigate paraphrastic LMs’ generalization ability to other languages, experiments
are conducted on a Mandarin Chinese broadcast speech transcription task. Using a paraphrastic multi-level LM modelling both word
and phrase sequences, significant error rate reductions of 0.9% absolute (9% relative) and 0.5% absolute (5% relative) were obtained
over the baseline n-gram and NNLM systems respectively, after a
combination with word and phrase level NNLMs.The research leading to these results was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology)This is the author accepted manuscript. The final version is available at http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6639308
Recommended from our members
A deep learning approach to assessing non-native pronunciation of English using phone distances
The way a non-native speaker pronounces the phones of a language
is an important predictor of their proficiency. In grading
spontaneous speech, the pairwise distances between generative
statistical models trained on each phone have been shown to be
powerful features. This paper presents a deep learning alternative
to model-based phone distances in the form of a tunable
Siamese network feature extractor to extract distance metrics directly
from the audio frame sequence. Features are extracted at
the phone instance level and combined to phone-level representations
using an attention mechanism. Pair-wise distances between
phone features are then projected through a feed-forward
layer to predict score. The extraction stage is initialised on either
a binary phone instance-pair classification task, or to mimic
the model-based features, then the whole system is fine-tuned
end-to-end, optimising the learning of the distance metric to
the score prediction task. This method is therefore more adaptable
and more sensitive to phone instance level phenomena. Its
performance is compared agains
Paraphrastic language models
Natural languages are known for their expressive richness. Many sentences can be used to represent the same underlying meaning.
Only modelling the observed surface word sequence can result in poor context coverage and generalization, for example, when using
n-gram language models (LMs). This paper proposes a novel form of language model, the paraphrastic LM, that addresses these
issues. A phrase level paraphrase model statistically learned from standard text data with no semantic annotation is used to generate
multiple paraphrase variants. LM probabilities are then estimated by maximizing their marginal probability. Multi-level language
models estimated at both the word level and the phrase level are combined. An efficient weighted finite state transducer (WFST)
based paraphrase generation approach is also presented. Significant error rate reductions of 0.5–0.6% absolute were obtained over the
baseline n-gram LMs on two state-of-the-art recognition tasks for English conversational telephone speech and Mandarin Chinese
broadcast speech using a paraphrastic multi-level LM modelling both word and phrase sequences. When it is further combined with
word and phrase level feed-forward neural network LMs, a significant error rate reduction of 0.9% absolute (9% relative) and 0.5%
absolute (5% relative) were obtained over the baseline n-gram and neural network LMs respectivelyThe research leading to these results was supported by EPSRC grant EP/I031022/1 (Natural Speech Technology)
and DARPA under the Broad Operational Language Translation (BOLT) program.This version is the author accepted manuscript. The final published version can be found on the publisher's website at:http://www.sciencedirect.com/science/article/pii/S088523081400028X# © 2014 Elsevier Ltd. All rights reserved
- …